Detect and repair grounding-file drift after schema changes by daniel-thom · Pull Request #72 · dsgrid/datasight

daniel-thom · 2026-05-11T00:41:18Z

Summary

Catches the failure mode where queries.yaml, schema_description.md, and time_series.yaml fall out of sync with the live database — typically after a datasight tidy review reshape. When that happens the LLM agent is silently trained on wrong column names: it hallucinates, refuses, or returns tables full of zeros despite a clean execution. Discovered while benchmarking local Ollama models against a wide→long reshape: every model produced 0/5 useful answers until the grounding files were rewritten by hand.

Two pieces, both opt-in at user-controlled checkpoints:

Plan 1 — `datasight verify` static drift check (no LLM)

New datasight.grounding library: AST-walks queries.yaml with sqlglot, scans backticked identifiers in schema_description.md, and validates time_series.yaml entries against the live schema. Suggests nearest matches by edit distance.
Auto-loads distinct values of low-cardinality VARCHAR columns to suppress false positives from prose like Values: `pacific`, `mountain`, ….
Wired into datasight verify as a cheap pre-flight check that runs before the LLM phase. New --static-only flag for fast drift-only checks; --skip-grounding-check disables it.

Plan 2 — `datasight tidy review` post-apply repair hook (LLM, opt-in)

New datasight.grounding_repair library: snapshots schema before the tidy transform, calls the configured LLM with both old and new schemas plus the drift report, validates every rewritten SQL against the live DB (retries up to 2x on failures), and returns proposed file contents without writing.
Hook in tidy review runs after apply_proposal completes: shows drift, prompts to repair, displays a unified diff, prompts to apply, and writes atomically via tempfile + os.replace.
Scope deliberately limited to the three LLM-grounding files. schema.yaml and measures.yaml remain owned by tidy_review's existing update helpers.

Also included: per-call token-rate reporting

The first commit on this branch threads wall-clock timing through the agent loop and surfaces it in the QUERY TOTAL log line as elapsed=X.XXs out_tps=Y.Y. Provider-agnostic — works for Ollama (whose OpenAI-compat endpoint strips native timing fields), Anthropic, OpenAI, and GitHub Models alike. Helped quantify the impact of the grounding fix during benchmarking.

Test plan

pytest tests/test_grounding.py — 20 unit tests covering drift detection, CTE/output-alias false positives, qualified table.column resolution, enum allowlist suppression, time_series checks, parse errors
pytest tests/test_grounding_repair.py — 18 tests covering JSON parsing (bare/fenced/non-object), atomic writes (validation-error files left untouched), happy-path repair, retry-on-bad-SQL, max-retries-exceeded fallback
pytest -q — full suite 1535 passed, 1 deselected (pre-existing live-LLM test needs qwen2.5:7b Ollama model installed, unrelated)
Smoke test: datasight verify --static-only against a known-drifted project flags the stale references and exits 1; against a clean project reports clean and exits 0
End-to-end manual: run a tidy review reshape on a test project, accept the repair prompt, confirm the rewritten files validate against the new schema

🤖 Generated with Claude Code

Adds elapsed_seconds wall-clock timing on every LLM API call, aggregates it through the agent loop, and surfaces it in the QUERY TOTAL log line and provenance JSON as `elapsed=X.XXs out_tps=Y.Y`. Provider-agnostic — works for Ollama (whose OpenAI-compatible endpoint omits its native timing fields), Anthropic, OpenAI, and GitHub Models alike. The reported rate is `output_tokens / elapsed_wallclock`, which includes prefill + network and is therefore a lower bound on the model's true decode rate; callers that want a clean decode rate should probe the model directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two complementary pieces that catch the failure mode where queries.yaml / schema_description.md / time_series.yaml fall out of sync with the live database (typically after a `datasight tidy review` reshape). When that happens the LLM agent is silently being trained on wrong column names — hallucinating, refusing, or returning tables full of zeros despite a clean execution. Plan 1 — `datasight verify` static drift check - New `datasight.grounding` library: AST-walks queries.yaml with sqlglot, scans backticked identifiers in schema_description.md, and validates time_series.yaml entries against the live schema. - Suppresses false positives from prose enumeration values by auto-loading distinct values of low-cardinality VARCHAR columns. - Wired into `datasight verify` as a cheap pre-flight check. New `--static-only` flag skips the LLM phase for fast drift-only checks; `--skip-grounding-check` disables it. Plan 2 — `datasight tidy review` post-apply repair hook - New `datasight.grounding_repair` library: snapshots schema before the tidy transform, calls the configured LLM with both old and new schemas plus the drift report, validates every rewritten SQL against the live DB (retries up to 2x on failures), and returns proposed file contents without writing. - Hook in `tidy review` runs after `apply_proposal` completes: shows drift, prompts to repair, displays unified diff, prompts to apply, and writes atomically via tempfile + os.replace. - Scope limited to queries.yaml, schema_description.md, and time_series.yaml — schema.yaml and measures.yaml stay owned by tidy_review's existing update helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-11T00:50:20Z

Codecov Report

❌ Patch coverage is 71.35417% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.86%. Comparing base (3a22f8a) to head (bd64229).

Files with missing lines	Patch %	Lines
src/datasight/cli_commands/tidy.py	38.46%	48 Missing ⚠️
src/datasight/grounding_repair.py	79.91%	48 Missing ⚠️
src/datasight/grounding.py	81.53%	41 Missing ⚠️
src/datasight/cli_commands/grounding.py	39.65%	35 Missing ⚠️
src/datasight/web/app.py	67.59%	35 Missing ⚠️
src/datasight/cli_commands/verify.py	64.70%	12 Missing ⚠️
src/datasight/cost.py	93.33%	1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (71.35%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #72      +/-   ##
==========================================
- Coverage   86.84%   85.86%   -0.98%     
==========================================
  Files          61       64       +3     
  Lines       11562    12317     +755     
==========================================
+ Hits        10041    10576     +535     
- Misses       1521     1741     +220

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Adds an "Apple Silicon: MLX-native models" subsection to the choosing-an-llm concepts page with measured numbers from the resource monitor: - qwen2.5:7b (q4_K_M GGUF): ~2 GB resident — only option that fits 16 GB - gemma4:e2b-mlx-bf16: ~11 GB — NOT low-memory despite the "e2b" naming, because the default 256K KV-cache buffer dominates - qwen3.6:35b-a3b-coding-mxfp8: ~38 GB — best answer quality, sparse MoE works well on Apple Silicon's unified memory Tier table makes the recommendation explicit: 16 GB → qwen2.5:7b, 32 GB → qwen2.5:7b or gemma4 (tradeoff: terse vs. richer), 48 GB+ → qwen3.6:35b-A3B. Cross-references added from install.md and configuration.md so users landing on either page get pointed at the measured guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…est import - grounding.py: the _DEFAULT_GROUNDING_FILES tuple was a vestige of an earlier design pass where callers iterated over it; the final API takes each path as a keyword argument so the constant is dead code. - tests/test_grounding.py: pytest was imported but the tests use only assert statements (no raises/fixtures from pytest itself). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a static “grounding drift” detector plus an opt-in LLM-based repair flow to keep queries.yaml, schema_description.md, and time_series.yaml aligned with the live DB schema (especially after datasight tidy review reshapes). It also threads per-call wall-clock timing through the LLM stack to report token rates in logs and provenance.

Changes:

Introduces datasight.grounding for static drift detection across grounding files and wires it into datasight verify (--static-only, --skip-grounding-check).
Introduces datasight.grounding_repair for LLM-driven rewrite + validation + atomic write, and hooks it into datasight tidy review post-apply.
Adds client-side elapsed timing to LLM usage/cost reporting and updates docs with expanded local-model guidance.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_grounding.py	Unit tests for static grounding drift detection and report formatting.
tests/test_grounding_repair.py	Unit tests for JSON parsing, retry/validation behavior, diffs, and atomic writes in repair flow.
tests/test_cli_tools.py	Updates cost-data tests to include elapsed timing + token rate fields.
src/datasight/llm.py	Measures per-API-call wall-clock time and exposes it via `Usage.elapsed_seconds`.
src/datasight/agent.py	Aggregates elapsed call time across the agent loop into `AgentResult.total_elapsed_seconds`.
src/datasight/cost.py	Adds elapsed + token-rate fields to cost summaries and logs when available.
src/datasight/cli.py	Plumbs elapsed/token-rate fields into logging and provenance.
src/datasight/grounding.py	New static drift checker + schema/enum helpers + drift report formatter.
src/datasight/grounding_repair.py	New LLM-driven repair orchestration, validation, diffing, and atomic write helper.
src/datasight/cli_commands/verify.py	Runs static grounding drift check before LLM verification; adds new CLI flags.
src/datasight/cli_commands/tidy.py	Post-apply drift detection + optional LLM repair prompt and atomic apply.
docs/use/how-to/install.md	Updates Ollama model guidance and Apple Silicon notes.
docs/use/concepts/choosing-an-llm.md	Adds MLX-native Apple Silicon guidance and benchmark table.
docs/reference/configuration.md	Expands `OLLAMA_MODEL` guidance and links to MLX section.
docs/reference/cli.md	Documents new `datasight verify` flags and behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The CLI ``datasight tidy review`` already calls grounding repair after ``apply_proposal`` completes; the web equivalent at ``/api/tidy/apply`` did not. This wires the same flow into the web endpoint: 1. Snapshot the pre-apply schema from ``state.schema_info`` before the transform runs (so the LLM can see both old and new shapes). 2. After the apply + existing schema.yaml/measures.yaml updates and the schema re-introspection, build the post-apply truth set and run the drift check. 3. When drift exists and an LLM client is configured, call ``repair_grounding``, validate each rewritten SQL against the live DB, and atomically write any files that pass validation. Failures are surfaced in the response but never fail the apply itself — the database mutation has already committed. 4. Surface the outcome as ``grounding_repair`` in the JSON response so the UI can tell the user what changed (or why nothing did). The repair runs *before* the schema_text rebuild because ``_load_user_description`` reads schema_description.md from disk; rewriting first ensures the next LLM call sees the repaired grounding. Tests cover the three branches the response shape distinguishes: applied successfully, skipped (no LLM client configured), and surfaced validation errors. ``repair_grounding`` is monkeypatched in tests since the FastAPI startup hook re-initializes ``state.llm_client`` from the environment and would clobber an in-place stub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The post-apply grounding repair was previously locked behind a synchronous call inside `tidy review` apply (CLI) and /api/tidy/apply (web). When the LLM call timed out — common with large prompts on local models — the user had no way to retry without re-running the database transform. Web requests also blocked on the slow LLM call, which made the apply UX feel unrelated to the actual DB work it had already committed. This commit splits repair from apply on both surfaces and adds the missing retry path: - Persist the pre-apply schema to .datasight/grounding_snapshot.json on every CLI tidy review and web apply. Both surfaces and the new standalone CLI command read it. - New `datasight grounding {check, repair}` group. `repair` reads the snapshot, runs the LLM rewrite, shows a diff, and prompts to apply. `--model` overrides the configured model for retry, `--from-csv` derives the pre-tidy schema from a source CSV when no snapshot exists, `--dry-run` skips the write. - Web /api/tidy/apply now returns immediately after the DB mutation and includes a static drift summary. The slow LLM repair moved to /api/tidy/grounding/repair, called from a banner the drawer shows after apply. The endpoint falls back to the on-disk snapshot when the in-memory copy is gone (post-restart). - New /api/grounding/status + header pill so users see drift any time it exists, not only after they just applied something. - `tidy review` accepts --model so the same flag works on every LLM-using command. Docs: - Per-workload model recommendation in choosing-an-llm.md: tool-use commands (tidy review) favor general qwen3.6, long-form generation commands (grounding repair) favor the coding-mxfp8 variant. Drops the previous single-observation hedge now that we have evidence from both LLM-using paths. - New "Repair grounding" section in curate-with-tidy-review.md. - cli.md regenerated with the new grounding group. - Top-level CLI help layout: orphans (config, session, tidy, grounding) moved into existing or new groups so there is no generic "Commands" bucket. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

EM101 in the two new "must not run" assertions added by the test split — assign the message to a variable first, matching the project-wide convention. C901 noqa added to three functions intentionally over the McCabe threshold; pre-existing but uncovered when the previous CI run skipped them on path filters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

- Run ruff format across grounding-touched files (10 files reformatted) - Stub llm_client inside TestClient context for the two grounding-repair endpoint tests so FastAPI startup's init_llm_client doesn't clear it in CI (no ANTHROPIC_API_KEY) - Sanitize stack-trace exposure: return generic messages from the grounding-repair endpoint instead of str(exc); keep server logs intact (CodeQL CWE-209) - Validate qualified column references against the referenced table's columns rather than the union, so renamed columns in one table aren't shadowed by their presence in another - Flag markdown table.column references where the table itself is missing instead of silently skipping them (the failure this check exists to catch) - Type-check group_columns before iterating in time_series drift check, matching load_time_series_config's contract - Quote identifiers via _quote_identifier and bound the enum-values scan with LIMIT max_per_column + 1 so a free-text column on a large table can't turn the pre-flight into a full scan - Make llm_retries semantics consistent: report the number of retries performed (attempt index on success, max_retries on give-up) - Exit with an error for --static-only when the DuckDB file is missing, instead of falling through to the LLM phase - Release state_lock around the slow LLM call in tidy_grounding_repair; re-acquire only for the small post-write critical section, with a project-switch guard so a concurrent project change can't be clobbered Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

- verify: reject --static-only + --skip-grounding-check as mutually exclusive (exit 2). The static branch is gated on `not skip_grounding_check`, so the combination silently fell through to the LLM phase, contradicting --static-only's documented semantics. - tidy_apply: move pre-apply schema snapshot (old_schema capture, state.pre_tidy_schema assignment, write_snapshot) inside state_lock, so a concurrent project switch can't desync the snapshot from the DB the apply mutates. - tidy_grounding_repair: wrap the prerequisite checks and input snapshot in `async with state.state_lock` to match what the pre-existing "before releasing the lock" comment claimed. The slow LLM call still runs outside the lock; the final mutation section already re-acquires it and guards on project_dir_snapshot. - Add a CLI test covering the new flag rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.

- tidy_grounding_repair: assign state.schema_info from the freshly introspected `tables` before rebuilding schema_map. Previously schema_text was built from `tables` while schema_map was built from the pre-introspection schema_info, so the two could desync if the DB schema ever drifted from schema_info at that point. Mirrors the assignment order in tidy_apply. - grounding_repair: loosen _FENCED_JSON so terse outputs like ```json\n{...}``` (no newline before the closing fence) match on the first try instead of falling through to the bare-JSON path — which previously left trailing backticks in the candidate string and forced a wasted retry. Whitespace on either side of the body is now optional. - Add regression tests covering both fenced-block edge cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Track per-call wall-clock time and emit token-rate in cost summary Adds elapsed_seconds wall-clock timing on every LLM API call, aggregates it through the agent loop, and surfaces it in the QUERY TOTAL log line and provenance JSON as `elapsed=X.XXs out_tps=Y.Y`. Provider-agnostic — works for Ollama (whose OpenAI-compatible endpoint omits its native timing fields), Anthropic, OpenAI, and GitHub Models alike. The reported rate is `output_tokens / elapsed_wallclock`, which includes prefill + network and is therefore a lower bound on the model's true decode rate; callers that want a clean decode rate should probe the model directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Detect and repair grounding-file drift after schema changes Adds two complementary pieces that catch the failure mode where queries.yaml / schema_description.md / time_series.yaml fall out of sync with the live database (typically after a `datasight tidy review` reshape). When that happens the LLM agent is silently being trained on wrong column names — hallucinating, refusing, or returning tables full of zeros despite a clean execution. Plan 1 — `datasight verify` static drift check - New `datasight.grounding` library: AST-walks queries.yaml with sqlglot, scans backticked identifiers in schema_description.md, and validates time_series.yaml entries against the live schema. - Suppresses false positives from prose enumeration values by auto-loading distinct values of low-cardinality VARCHAR columns. - Wired into `datasight verify` as a cheap pre-flight check. New `--static-only` flag skips the LLM phase for fast drift-only checks; `--skip-grounding-check` disables it. Plan 2 — `datasight tidy review` post-apply repair hook - New `datasight.grounding_repair` library: snapshots schema before the tidy transform, calls the configured LLM with both old and new schemas plus the drift report, validates every rewritten SQL against the live DB (retries up to 2x on failures), and returns proposed file contents without writing. - Hook in `tidy review` runs after `apply_proposal` completes: shows drift, prompts to repair, displays unified diff, prompts to apply, and writes atomically via tempfile + os.replace. - Scope limited to queries.yaml, schema_description.md, and time_series.yaml — schema.yaml and measures.yaml stay owned by tidy_review's existing update helpers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document measured per-model memory footprints for Apple Silicon Adds an "Apple Silicon: MLX-native models" subsection to the choosing-an-llm concepts page with measured numbers from the resource monitor: - qwen2.5:7b (q4_K_M GGUF): ~2 GB resident — only option that fits 16 GB - gemma4:e2b-mlx-bf16: ~11 GB — NOT low-memory despite the "e2b" naming, because the default 256K KV-cache buffer dominates - qwen3.6:35b-a3b-coding-mxfp8: ~38 GB — best answer quality, sparse MoE works well on Apple Silicon's unified memory Tier table makes the recommendation explicit: 16 GB → qwen2.5:7b, 32 GB → qwen2.5:7b or gemma4 (tradeoff: terse vs. richer), 48 GB+ → qwen3.6:35b-A3B. Cross-references added from install.md and configuration.md so users landing on either page get pointed at the measured guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

daniel-thom and others added 2 commits May 10, 2026 18:40

github-advanced-security AI found potential problems May 11, 2026

View reviewed changes

Comment thread src/datasight/grounding.py Fixed

Comment thread tests/test_grounding.py Fixed

daniel-thom and others added 2 commits May 10, 2026 19:29

daniel-thom requested a review from Copilot May 11, 2026 02:01

Copilot started reviewing on behalf of daniel-thom May 11, 2026 02:01 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread src/datasight/cli_commands/verify.py

Comment thread src/datasight/grounding.py Outdated

Comment thread src/datasight/grounding_repair.py Outdated

Comment thread src/datasight/grounding.py Outdated

Comment thread src/datasight/grounding.py Outdated

github-advanced-security AI found potential problems May 11, 2026

View reviewed changes

Comment thread src/datasight/web/app.py Fixed

github-advanced-security AI found potential problems May 11, 2026

View reviewed changes

Comment thread src/datasight/web/app.py Fixed

Comment thread src/datasight/cli.py

Comment thread src/datasight/cli_commands/grounding.py

daniel-thom requested a review from Copilot May 11, 2026 13:33

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread src/datasight/grounding.py

Comment thread src/datasight/grounding.py Outdated

Comment thread src/datasight/grounding_repair.py

Comment thread src/datasight/web/app.py Outdated

daniel-thom requested a review from Copilot May 11, 2026 14:55

Copilot started reviewing on behalf of daniel-thom May 11, 2026 14:55 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread src/datasight/cli_commands/verify.py

Comment thread src/datasight/web/app.py Outdated

Comment thread src/datasight/web/app.py Outdated

daniel-thom requested a review from Copilot May 11, 2026 15:14

Copilot started reviewing on behalf of daniel-thom May 11, 2026 15:15 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread src/datasight/web/app.py

Comment thread src/datasight/grounding_repair.py

daniel-thom merged commit 08a8851 into main May 11, 2026
7 of 8 checks passed

daniel-thom deleted the grounding-drift-detection-and-repair branch May 11, 2026 17:18

Conversation

daniel-thom commented May 11, 2026

Summary

Plan 1 — datasight verify static drift check (no LLM)

Plan 2 — datasight tidy review post-apply repair hook (LLM, opt-in)

Also included: per-call token-rate reporting

Test plan

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Plan 1 — `datasight verify` static drift check (no LLM)

Plan 2 — `datasight tidy review` post-apply repair hook (LLM, opt-in)

codecov Bot commented May 11, 2026 •

edited

Loading